-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48864: [C++] Support customizing more Zstd parameters #48865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
e60e703 to
730eb94
Compare
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @HuaHuaY . This can be a useful improvement, I've made a bunch of comments below.
730eb94 to
4ab207e
Compare
|
@pitrou Thank you for your detailed review! I've fixed these comments. Please take a look again. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly good! Just a few nits.
| {5, GZipFormat::DEFLATE, -12, false}, | ||
| {-992, GZipFormat::GZIP, 15, false}}; | ||
|
|
||
| template <std::derived_from<arrow::util::CodecOptions> T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat, a concept 😀
| ARROW_ASSIGN_OR_RAISE(auto cctx, CreateCCtx()); | ||
| size_t ret = ZSTD_compressCCtx(cctx.get(), output_buffer, | ||
| static_cast<size_t>(output_buffer_len), input, | ||
| static_cast<size_t>(input_len), compression_level_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny that the compression level is passed a second time here, after we already gave it to ZSTD_CCtx_setParameter. I suppose it doesn't matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context will also be passed to ZSTDCompressor and it needs compression_level_. We can move auto ret = ZSTD_CCtx_setParameter(cctx.get(), ZSTD_c_compressionLevel, compression_level_) to MakeCompressor though I think the readability will decrease a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I check the Zstd's document again, I think there is a mistake here. We need to use ZSTD_compress2 instead of ZSTD_compressCCtx because ZSTD_compressCCtx will reset other parameters. Thank you for your good attention!
/*! ZSTD_compressCCtx() :
* Same as ZSTD_compress(), using an explicit ZSTD_CCtx.
* Important : in order to mirror `ZSTD_compress()` behavior,
* this function compresses at the requested compression level,
* __ignoring any other advanced parameter__ .
* If any advanced parameter was set using the advanced API,
* they will all be reset. Only @compressionLevel remains.
*/
ZSTDLIB_API size_t ZSTD_compressCCtx(ZSTD_CCtx* cctx,
void* dst, size_t dstCapacity,
const void* src, size_t srcSize,
int compressionLevel);
/*! ZSTD_compress2() :
* Behave the same as ZSTD_compressCCtx(), but compression parameters are set using the advanced API.
* (note that this entry point doesn't even expose a compression level parameter).
* ZSTD_compress2() always starts a new frame.
* Should cctx hold data from a previously unfinished frame, everything about it is forgotten.
* - Compression parameters are pushed into CCtx before starting compression, using ZSTD_CCtx_set*()
* - The function is always blocking, returns when compression is completed.
* NOTE: Providing `dstCapacity >= ZSTD_compressBound(srcSize)` guarantees that zstd will have
* enough space to successfully compress the data, though it is possible it fails for other reasons.
* @return : compressed size written into `dst` (<= `dstCapacity),
* or an error code if it fails (which can be tested using ZSTD_isError()).
*/
ZSTDLIB_API size_t ZSTD_compress2( ZSTD_CCtx* cctx,
void* dst, size_t dstCapacity,
const void* src, size_t srcSize);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a new test TEST(TestCodecMisc, ZstdLargerWindowLog) to ensure the other parameters work effectively.
dfbca3b to
84dd650
Compare
5d26717 to
698aefd
Compare
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me (I haven't gone through the test). Just left some minor comments.
| } | ||
|
|
||
| TEST(TestWriterProperties, SetCodecOptions) { | ||
| constexpr int ZSTD_c_windowLog = 101; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use kZstdCWindowLog to be consistent with this code base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to keep the name style of zstd. Because I think it is more readable, although it is inconsistent with our coding style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can just keep this in the tests.
| } | ||
|
|
||
| TEST(TestWriterProperties, SetCodecOptions) { | ||
| constexpr int ZSTD_c_windowLog = 101; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can just keep this in the tests.
20955d7 to
c6c7282
Compare
mapleFU
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you @HuaHuaY . I'll rebase for CI.
4dd93e4 to
2af1945
Compare
|
@github-actions crossbow submit -g cpp |
|
Revision: 2af1945 Submitted crossbow builds: ursacomputing/crossbow @ actions-512d34afcc |
|
@HuaHuaY The new tests need to be skipped if zstd is not available: |
2af1945 to
a6b30fd
Compare
My mistake. I have added an if statement to skip the tests. @pitrou |
|
@github-actions crossbow submit cppminimal* |
|
Revision: a6b30fd Submitted crossbow builds: ursacomputing/crossbow @ actions-3fe0b907c4
|
a6b30fd to
d5c8568
Compare
|
Sorry that I have refactored a little code. Could you help me take a look and approve this PR again? @pitrou |
|
@github-actions crossbow submit cppminimal* |
|
Revision: d5c8568 Submitted crossbow builds: ursacomputing/crossbow @ actions-c4d50b6264
|
|
Thanks again @HuaHuaY ! |
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 3ed9169. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Support customizing more Zstd parameters to allow users to optimize the performance of parquet page compression/decompression.
What changes are included in this PR?
ZstdCodecOptionsinherited fromCodecOptions.ZSTD_CCtx/ZSTD_DCtxexplicitly inZSTDCodec.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes. The new class
ZstdCodecOptionsis added.